Spatial homogeneity pursuit of regression coefficients for hand, foot and mouth disease in Xinjiang Uygur Autonomous Region in 2018

To explore the complex spatial pattern between the incidence of hand, foot, and mouth disease (HFMD) and meteorological factors [average temperature (AT), average relative humidity (ARH), average air pressure (AP), average wind speed (AW)], this paper constructed a Spatial Clustering coefficient (SCC) regression model to detect spatial clustering patterns of each regression coefficients in different seasons. The results revealed that compared with geographically weighted regression (GWR), the coefficients estimated by SCC method were more smooth with clearly identified spatial and improved edge effects. Therefore, interesting spatial patterns were easy to identify in the SCC estimated coefficients. And then, the SCC method had better estimation accuracy in estimating the relationship between potential meteorological factors and HFMD cases. Meteorological factors had different significance in their effect on HFMD incidence depending on the season. Specifically, the influence of AT on HFMD was negatively correlated in summer and winter, especially in the Altay region, Bayingoleng Mongolian Autonomous Prefecture, Turpan region and Hami region. Second, AW had positive effects with HFMD in summer, but the AW played a negative role in the whole Xinjiang in winter. In Tianshan district, Shayibake district, Shuimogou district, etc. in summer, ARH showed a strong negative correlation, but in Alar city it had a high positive correlation, however, in winter ARH showed a high negative correlation in Altay regions, Aksu region and other places had negative effects, and it showed a strong positive correlation in Shayibak district. Finally, AP had a strong positive correlation with HFMD in summer in Shaybak district, but in winter, AP showed a strong negative correlation in Altay district and Buxel Mongolia Autonomous county. In summary, Xinjiang should adapt measures to local conditions, and formulate appropriate HFMD prevention strategies according to the characteristics of different regions, time, and meteorological factors.

Hand, Foot and Mouth Disease (HFMD) is a viral(enterovirus), infectious and global disease that occurs frequently in children under the age of 5. The Ministry of Health of China designated HFMD as a "C" infectious disease in May 2008. In recent years, it has occurred all over China throughout the year. The threat of HFMD to public health has promoted scholars to study the characteristics and causes of the disease. Many researchers focused on the impact of meteorological factors on HFMD. Studies have shown that meteorological factors such as average temperature and average relative humidity will affect the incidence of HFMD [1][2][3][4][5] . Ma et al. found that the HFMD visit rate in Hong Kong was positively correlated with the average temperature, daily temperature difference, relative humidity, and wind speed 6 . And in Guangzhou and Shenzhen, by time series analysis, they revealed that temperature was positively correlated with the incidence of HFMD 7,8 . But a Japanese study found that the number of days per week with an average temperature exceeding 25 °C was negatively correlated with the incidence of HFMD 9 . However, the above studies did not consider the spatial differences in the incidence of HFMD. The GWR model, as a spatial statistics method can effectively to model and analyze the heterogeneity of space. Hu et al. used GWR model to investigate the potential factors affecting the incidence of HFMD in children 10 . Similarly, Hong et al. used GWR model to discuss the relationship between HFMD and meteorological factors in Inner Mongolia, and they discovered that there was a correlation between these factors and HFMD, and each factor had its unique spatial heterogeneity 11,12 , as demonstrated by Wang et al. and Bo et al. 13 .
It is common knowledge that GWR models tend to estimate regression coefficients as continuous functions, however for data collected from a large area, the relationship between response variables and predictors may exhibit complex spatial dynamic patterns, and the variation of regression coefficients is not necessarily continuous. In particular, relationships between spatial variables may change abruptly at the boundaries of adjacent clusters but remain relatively homogeneous within clusters [14][15][16] . Fortunately, the study on the homogeneity pursuit of regression coefficients in high-dimensional data analysis will help to solve this problem. In these studies 17,18 , pairwise coefficient differences are penalized to encourage homogeneity among coefficients. According to these ideas, Li et al. proposed Spatial Clustering Coefficient (SCC) regression to detect the spatial clustering pattern of regression coefficients, which integrates spatial domain information by constructing appropriate regularization to automatically detect mutation points in regression coefficients, and due to its relatively strong local adaptability, it can also estimate continuously changing regression coefficients 19 . Therefore, this paper was to detect and analyze the spatial clustering patterns of AT, MT, ARH, AP, and AW on HFMD occurrence in Xinjiang in 2018 using the SCC model at county-level and monthly-scale.

Results
Statistical analysis. From January 1 to December 31, 2018, the total number of HFMD cases in 100 counties in Xinjiang was 10,260. The monthly distribution of HFMD cases in Xinjiang is shown in Fig. 1. The peak of HFMD occurred in May-July, and the second peak occurred in September-November. Therefore, May-December is chosen as the main research period.
The distribution of HFMD cases in Xinjiang from May 1 to December 31, 2018 is shown in Fig. 2. The results revealed that the incidence of HFMD in northern Xinjiang is higher than that in southern Xinjiang, and the high incidence areas include Changji city, Xinshi district, Shayibak district, Shuimogou district, Midong district, Tianshan district, Shihezi city, Yining city, Jinghe county, Karamay district and Yizhou district.
It can be clearly seen from Fig. 3 that the temperature distribution in Altay region and Tacheng region is similar, and the temperature distribution in Ili Kazakh Autonomous Prefecture, Aksu, Kizilsu Kirgiz Autonomous Prefecture, Kashi region and Hotan region are the same. For ARH, Altay region is a category, Ili Kazakh Autonomous Prefecture is a category, Aksu region and Kizilsu Kirgiz Autonomous Prefecture are clustered into a category, and Kashi region, Hotan region and part of Bayingoleng Mongolian Autonomous Prefecture are a category. The spatial clustering of AP is similar to that of ARH, and AW distribution is similar to that of AT. The specific spatial clustering results are shown in Fig. 3.

Correlation analysis of meteorological factors.
Spearman correlation coefficient analysis is used to explore the relationship between the incidence of HFMD (NUM) and meteorological factors in Xinjiang. The detailed results are shown in Table 1. The results reveal that the correlation between AT and MT is 0.968, which is highly correlated. Therefore, meteorological factors other than MT are selected as explanatory variables when establishing the SCC model. Spatial cluster analysis of regression coefficient. In order to compare the spatial clustering patterns of the regression coefficients in different time periods, this paper mainly selects the data of June and November for analysis, because the statistical analysis shows that the number of HFMD cases in June and November is higher. Next, the partial coefficient estimates obtained by the SCC method are shown in Figs. 4 and 5. At the same time, this paper also shows the coefficient estimates obtained by the GWR method, which is convenient for comparison and verification.
By comparing Figs. 4 and 5 , it can be seen that the estimated coefficients β(u, v) vary spatially throughout different seasons. First of all, the incidence of HFMD in southeastern Xinjiang is higher than that in other regions in June 2018. (the first map of SCC in Figs. 4 and 5). Secondly, the AT in June has a negative correlation with HFMD, and the impact categories are mainly divided into three categories, among which it plays an important   www.nature.com/scientificreports/ observing Figs. 3 and 4, it is found that the spatial clustering pattern of the regression coefficients is highly similar to the spatial distribution characteristics of meteorological factors, which indicates that the SCC method can accurately identify the real spatial characteristics of the regression coefficients studied in this paper.  www.nature.com/scientificreports/ From the observation of Fig. 6 and Table 3, it can be seen that the curve drawn by the actual patient value and the estimated value based on SCC model presents a 45 straight line, and the coefficient of determination estimated based on the SCC method is close to 1, indicating that the estimated value and the true value are almost the same, thus illustrating the accuracy of the SCC model estimation. To further illustrate the accuracy of SCC model estimation, this paper compares SCC and GWR model estimation results, and the results are shown in Fig. 7. The MSE values estimated by SCC are all around 0 and better than the estimated by GWR. www.nature.com/scientificreports/

Discussion
Spatial epidemiology can clearly understand the spatial clustering of diseases, explore the clustering location and exact scope through spatial analysis models 20 . At present, many studies have carried out spatial correlation analysis on the characteristics of the number of people infected with various infectious diseases, which have not been considered. The influence of other factors on the number of infectious diseases [21][22][23] . Moreover, there are few spatial analyses of the incidence of HFMD in Xinjiang, and relevant studies have focused on the analysis of epidemiological signs 24 .
In this study, based on the dataset of 100 counties and cities in Xinjiang, Spatial Clustering Coefficient (SCC) regression is used to detect the spatial clustering patterns of regression coefficient and estimates the regression coefficient values, so as to speculate the influence pattern of meteorological factors on HFMD cases. The results   www.nature.com/scientificreports/ show that the regression coefficients have different spatial clustering patterns for different seasons. Compared with previous studies [25][26][27] , the AT, ARH, AP and AW factors all have an impact on the number of HFMD cases in Xinjiang. The existing studies have shown that the incidence of HFMD in northern Xinjiang is higher because of its high population density, which is consistent with the results of this study. Secondly, the study finds HFMD cases decline as AT rise in June. Some studies have been shown that the influence of temperature on HFMD is inverted "U". When the temperature is lower than a certain value, the incidence of HFMD increases with the increase of temperature, whereas the risk of HFMD decreases with the increase of temperature 4,7 . And Xinjiang is in the midsummer season in June, because the daytime average temperature is as high as 28 °C, which leads to a decrease in the number of people going out, thus reducing the incidence of HFMD 1 . On the contrary, AW has a positive impact on the number of HFMD cases. It may be that the increased wind speed promotes the spread of the virus and the higher wind speed will cause cool weather, which will further promote people's outdoor activities, so the incidence of HFMD increases. The regression coefficients of ARH have five spatial clustering patterns. First, with the increase of ARH, the incidence of HFMD decreases fastest in Urumqi and its surroundings compared with other areas, and at the same time in Aral City, with the increase of ARH, the incidence of HFMD increased rapidly. AP plays an important role in Xinshi District, Sayibak District, and Tianshan District, while it shows a strong negative correlation in Aksu District, and a study in Guangdong revealed that for every 1 hPa increase in air pressure, the number of cases decreased by 6.8 28 . This may be because lower air pressure may weakens the human's/organism's immune system 29 . In early winter (November), Similar to June, AT has a negative correlation with the incidence of HFMD and its regression coefficients have three cluster, but its influence intensity is relatively weak, and it has a strong explanatory power in Bayingoleng Autonomous Prefecture and other areas. The reason may be that as the temperature dropped, the virus began to multiply in large numbers, and susceptible people gathered, which accelerated the spread of the virus 28 . The above results fully reflect the seasonal characteristics of enterovirus 6,9 . AW and HFMD cases shows a negative correlation in Xinjiang. According to a systematic review on the association between ventilation and infection suggested that higher ventilation rates could decrease infection rates or outbreaks of some airborne diseases 30 . Hence, it is likely that strong natural ventilation could serve as a barrier for the spread of respiratory droplets. For the regression coefficients of ARH, it has relatively complex spatial patterns, and it may seem that the ARH in each region has a different impact on the number of HFMD cases due to the complex terrain of Xinjiang. According to previous research, in days with high relative humidity, the infected people may excrete more enteroviruses into the environment 31 , and these enteroviruses can easily attach to the surface of toys or small particles in the air, and the process may cause the virus to accumulate in the environment, thereby accelerating the spread of HFMD 32 . This is consistent with the results of this study in November (except for the northern, northwestern, and southwestern regions of Xinjiang). In summary, the coefficient spatial clustering patterns of AT, ARH, AP and AW in June and November in Xinjiang are detected by using the SCC method, and the influence characteristics of meteorological factors on HFMD are analyzed. It is hoped that this will provide some theoretical basis for prevention and control of HFMD in Xinjiang. Model. Suppose the spatial data (x(u i , v i ), y(u i , v i )), i = 1, · · · , n is the observation of the observation position (u 1 , v 1 ), · · · , (u n , v n ) ∈ R 2 , where the response variable y(u i , v i ) is spatially correlated and

Methods
. Consider a spatial variable coefficient model Clearly, this problem needs to be regularized because there are more variables and parameter than observations. For spatial problems, the association between the response variable and the explanatory variables at nearby locations is expected to be highly homogenous, which prompted us to assign β a regularization function that reflects this spatial homogeneity.
Specifically, the paper minimizes the SCC model to estimate β where E is an edge set consisting of n vertices, each of which corresponds to an observation position, and | · | is a generalized lasso penalty function to judge the corresponding positions of the two regression coefficients (u i , v i ) and (u j , v j ) are connected by an edge in E, thus encouraging homogeneity between the two regression coefficients. is a penalty parameter, which is selected using the BIC criterion in this example. Once an edge set E is given, (2) is written in matrix form www.nature.com/scientificreports/ where H is an m × n matrix consisting of an edge set E with m edges. For the edge connecting two positions (u i , v i ) and (u j , v j ) this paper denote the penalty term |β k (u i , v i ) − β k (u j , v j )| as |H m β k | , H m represents the row vector of H , which contains only two non-zero elements, the i-th element is 1, and the j-th element is − 1. The marginal set E is a key component of the SCC model because it reflects prior assumptions about the structure of the regression coefficients. This paper constructs the edge set E by generating a minimum spanning tree (MST).
Once the MST is constructed, the resulting penalty no longer contains redundant terms, so the original problem can be easily transformed into a lasso or lasso-type problem after appropriate reparameterization. Define new parameters θ k , k = 1, · · · , p as θ k = H 1 n 1 T β k = Hβ k . This new design matrix is written as X = [diag(X 1 ) H −1 , · · · , diag(X n ) H −1 ] , where H is an n × n invertible matrix, Since H rows are full rank, there is a one-to-one transformation between β k and θ k . Then, the SCC model in equation 4 can be rewritten as ′ is a vector of np, and B represents the index set B = {l : mod(l, n) � = 0, l = 1, · · · , np} , does not include the n-th element. For simplicity, the paper will denote l∈B |θ l | by |θ B | 1 . Therefore, by solving the lasso problem in (4) with respect to the parameter θ , the solution of the SCC model (2) with lasso penalty can be obtained. The estimator of β is given by β k = H −1θ k , k = 1, · · · , p.